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ABSTRACT 

We discuss a novel approach to identifying cosmic events in separate and independent observations. 
In our focus are the true events, such as supernova explosions, that happen once, hence, whose 
measurements are not repeatable. Their classification and analysis have to make the best use of all 
the available data. Bayesian hypothesis testing is used to associate streams of events in space and 
time. Probabilities are assigned to the matches by studying their rates of occurrence. A case study of 
Type la supernovae illustrates how to use lightcurves in the cross-identification process. Constraints 
from realistic lightcurves happen to be well-approximated by Gaussians in time, which makes the 
matching process very efficient. Mo del- dependent associations are computationally more demanding 
but can further boost our confidence. 

Subject headings: methods: statistical — methods: data analysis — surveys — time — supernovae: 
general 



1. INTRODUCTION 

Advancements in detector technology have recently 
opened up new opportunities for time-domain astron- 
omy. The next-generation surveys will observe most 
of the sky not once but many times over, which en- 
ables for systematic studies of variable and transient 
sources. The first such programs are already online. 
One examp le is the Catalina Real-Time Transient Sur- 
vey fCRTS: [Drake et al.|[2009l ) that uses three dedicated 
telescopes, and has reported over two thousand new tran- 
sients to date. Another is the Palomar Transient Factory 
(PTE; iRau et al . 2009) that has discovered close to 800 
supernovae. The Pan-STARRS telescope that started 
survey operation in 2010, will see thousands of super- 
novae a month, no t to mention the Sky Mapper project 
(jKeller et al.ll2007[) or the Large Synoptic Survey Tele- 
scope (LSST) that aims to make the "greatest movie of 
the Universe" by covering most of the sky in every few 
days. 

Modern surveys implement automated discovery in 
their pipelines. They provide live streams of events in 
both machine and human readable formats to facilitate 
rapid follow-up observations and real-time analysis. An 
interoperable representation is the VOEvent format de- 
fined by the International Virtual Observatory Alliance 
(IVOA, see www.ivoa.net). Capitalizing on this stan- 
dard, the VOEventNet . org project promises to deliver 
alerts to interested subscribers within minutes of dis- 
covery. Another important reason for standardized live 
streams is the joint analysis of the events. Users of the 
SkyAlert . org site can browse for discoveries in a grow- 
ing selection of data streams. The project will automati- 
cally associate any events within 10 degrees and less than 
1 hour apart (Roy Williams, private communication). 

One outstanding issue is the proper statistical founda- 
tion of event aggregation. The methods developed for 
cross-identification of catalogs are not directly applica- 
ble. While positional information is clearly needed for 
transient events, too, one cannot statistically interpret 
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the spatial thresholds. To make a probabilistic state- 
ment one has to consider the density but the host objects 
might not be visible, only the explosions. In fact, events 
really are more abstract entities in spacetime, whose den- 
sity we have to study in 3-1-1 dimensions. Rigorous sta- 
tistical assessments are most important in the presence 
of crowding: large density compared to the uncertain- 
ties. Spatially the optical observations are typically much 
more accurate than the X-ray, radio, let alone gamma 
or gravitational wave detections. In time, this is espe- 
cially relevant considering that the frequency of events 
depends on the cutoff parameter of the detection. By 
lowering the threshold we can study fainter and more dis- 
tant events that are although low significance in any one 
survey, could be interesting when seen in many. In this 
paper, our goal is to derive the probabilistic matching 
formalism for transient events, which enables the auto- 
mated association of detections in space and time. 

In Section[2]we use Bayesian hypothesis testing to eval- 
uate the quality of candidate associations. Section [3] de- 
scribes why it is not impossible after all to derive prob- 
abilities for these matches. In Section 2] we analyze sim- 
ulated SNia lightcurves to derive time constraints, and 
show an accurate approximation that provides analytic 
results. Section E] concludes our study. Throughout the 
paper, the capital P and L symbols represent probabil- 
ities and likelihoods, respectively, and the lower-case p 
letters are probability densities. 

2. MATCHING EVENTS 

Many things happen in the sky. Asteroids move, stars 
pulsate, supernovae and gamma-ray bursts explode all 
over the place. Such astronomical events are typically 
discovered in photometric images based on the change in 
the apparent brightness between different epochs. The 
imaging cadence is strategically planned to meet the sci- 
entific goals. The epochs are optimized for the charac- 
teristic time-scale of the intended targets, which might 
be weeks, days, or just tens of minutes. Looking at re- 
peated observations, we can spot the significant changes, 
e.g., using a 5ct threshold. At that point, it is possible 
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go back to extract the lightcurve of the detected event 
from the earlier images. Variable sources are revisited 
several times over the course of a survey, which provides 
ample amount of data. Cosmic events, however, hap- 
pen just once and we need to make the best use of the 
observations. 

Considering that different photometric pass-bands 
potentially probe different physical processes, the 
lightcurves can be quite a bit different. The detections 
can be delayed from one instrument to another. One 
thing, however, is common: the start time of the event, 
which can be estimated in all detections by looking at 
the changing flux as function of time. 

Let us now consider a single candidate association that 
consists of n reported events, one from each survey i 
(where i—1 to n). They will have measured Dx = {xi} 
positions in the sky (unit vectors of the directions) and 
can constrain Dt = {ti} start times with some uncertain- 
ties. The time constraint is not the precision of our clock, 
but rather the ability of our model M of the given sur- 
vey to estimate the start time t of a particular event. 
Formally, it is a likelihood L{T\t) = p{t\T, M) that is a 
function of the model's r epoch. For now, let us assume 
that Li(r|-) is known for every survey. In Section|3]we 
take a closer look at how to derive such likelihood func- 
tions from observed lightcurves; in particular we analyze 
the Type la supernovae in detail. 

We ask the question whether the detections in the 
given tuple are from the same event or not using Bayesian 
hypothesis testing. We wish to compare two hypotheses. 
The first one, call it H, claims that all detections are 
of the same cosmic event, hence they have a common 
epoch (r) and a common true position (both unknown 
to us.) The second hypothesis is the former's comple- 
ment K that allows for any one of the detections to be 
from a separate event, which we model using a set of pa- 
rameters, e.g., the {Ti] epochs. The Bayes factor makes 
the comparison, which is the ratio of the probabilities 
of the measurements in the two hypotheses. For inde- 
pendent measurements, we can calculate separate Bayes 
factors for the position and time measurements, and the 
combination of the two is just their product. 
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The Bj. term is the s ame as for the stat i c obj ects as 
discussed in detail by iBudavari fc Szalavl ()20Q8I ). The 
numerator and the denominator of Bt are integrals of 
the known likelihood functions multiplied by the prior, 

^ n 

p{Dt\H)^ JdTp{T\H) 1[L,{t\U) (2) 

i 

n „ 

p(Dt\K)^\[ dn p{n\K) L,{n\u) (3) 

The choice of the prior naturally affects the Bayes factor, 
hence it requires some more consideration. 

Common sense dictates that the prior on the epoch r 
should be flat as the event can occur at any time with the 
same probability density. The calculation of the Bayes 
factor, however, requires a proper prior, whose integral 
is unity. Considering that all relevant measurements and 



the non-zero support of their uncertainties can be con- 
strained to a finite interval without any practical limita- 
tions, we can introduce a flat prior on an arbitrary but 
wide enough [T, T-l-A] interval, and define 

P^-^l ^ \ otherwise 

Using this flat prior, the likelihoods of the hypotheses 
become 

P n 

p{Dt\H) = A-' JdTl[LdT\k) (5) 
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p{Dt\K) = A-"l[ Jdn Hn\U) (6) 
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where we omitted the integration limits as they are ir- 
relevant for any large enough interval, and arrive at the 
dependence of 

Bt oc A"-i (7) 

This means that the wider we make the interval, the 
larger the Bayes factor becomes, which seems to artifi- 
cially go against our expectation for an objective quality 
measure of the associations. 

3. STREAMS OF EVENTS 

Cosmic events should not be looked at in isolation. A 
given telescope provides an entire stream of events, which 
is typically published online in an automated fashion. 
Our goal is to merge several independent streams of many 
surveys. The resulting combined stream will carry more 
information on the individual events, which would help 
with their scientific analysis. Along with our stream of 
associations we want to include the probabilities. 

The rate of events varies from survey to survey. For 
a given time interval the number of occurrences can be 
quite different. Indeed this is a key piece of information 
that enters the calculations. By definition the posterior 
probability of an association is computed from the Bayes 
factor and the prior of the hypothesis. If we assume that 
the prior probability is uniform, i.e., it takes the same 
value independent of the position in the sky, etc., we can 
write it as the ratio 
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just like in the static case. Here Ni is the number of 
events in the ith survey in a given time interval A, and 
Ni, is the number of events that appear in all surveys, 
i.e., the intersection of their selection functions. In our 
case it is only natural to use v frequencies instead, and 
substitute N ~ vA into the prior. 
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The probability shrinks with increasing time intervals, 
i.e., it is less probable to randomly select the same event 
from larger sets. In practice, the number of events is al- 
ways large, otherwise their frequency could not be mea- 
sured. More formally, A can be arbitrarily wide to yield 
large Ni counts and a small prior. For vanishing priors. 
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the posterior depends only on tlie product of the Pq prior 
and the Bayes factor 
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hence the dependence on A actually cancels out in the 
product of 
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The second term of the product, /?(, can be directly cal- 
culated from the data. The value of i/* in ttq can also 
be estimated over time from the ensemble statistics of 
the streams. We recall th e self-consistency argument of 
iBudavari fc Szalavl (|2008f ): the sum of the priors over the 
all possible Y\Ni combinations, which is A^* in total by 
definition, is also equal to the sum of the posteriors. This 
equality is an equation for Ni,, or Vi, in this case, that in 
turn provides well-determined prior and posterior prob- 
abilities. In summary, the posterior is well-defined and 
calculated as 

To select candidate associations early on, when not 
enough statistics are available to reliably solve for i/*, 
one can use the upper bound of jy™^'^ = minjz^i} to over- 
estimate the posteriors. Similarly if Vi are uncertain, one 
can use conservative limits instead. By safely overesti- 
mating the prior, we end up with somewhat larger pos- 
teriors than the actuals, thus a fix probability cut yields 
slightly more candidates that we can later prune. This 
will not overshoot by too much because large posteriors 
are n ot particularly sen sitive to small variations in the 
prior (iHeinis et al.|[2009l) . 

4. CONSTRAINTS FROM LIGHTCURVES 

One cannot directly measure the epoch of an event, 
only the fluxes as a function of time. If we can model 
the change, these lightcurves can provide the time con- 
straints. For example, a Type la supernova is described 
by its absolute magnitude M, redshift z and epoch r. For 
any given set of such parameters we can derive the fluxes 
at the times of pre-scheduled observations, and compare 
them to reality. The likelihood function is given by the 
deviation of the photometric measurements / and the 
simulated fluxes /(r, Af, z) as 
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where S represents the photometric errors in the normal 
distribution N. By integrating over AI and z with their 
proper priors, we can derive the t dependence needed for 

, , 

We use a Gaussian luminosity function (jPahlen et al.l 
[200l with M^=-19 and (Tm=0.5 for the prior, and 
simulate a typical M^, SNia using the LSST cadence 
(jlvezic et al.l 120081 ). In this cadence, pairs of obser- 
vations separated by an hour are repeated every 3 
days. Throughout this model we use precision LSST 
lightcurve templates from Peter Nugent (via Andy Con- 
nolly, private communication) but consistently neglect 
the K-correction. The error on the g-band photometry 
is determined from the 5a limiting magnitude of 25. The 



simulated event is set to happen at day. The luminosity 
function determines the kind of SN la events that can oc- 
cur, and the decline at high redshifts is set by the limiting 
magnitude of the survey. Together these set the integra- 
tion domain for our calculations. The peak brightness of 
the simulated supernova at redshift 2;=0.8 is 24.5 magni- 
tudes in the g band. This distance is approximately at 
the peak of the redshift distribution. 

Figure[l] shows the likelihood as a function of the epoch 
for a noiseless set of observations. The results are not 
sensitive to the end date as long as the integration limit 
is in the far tail of the decay. The measurements (open 
circles) follow a Gaussian surprisingly well. The solid 
line helps to guide the eye: the curve has a=l day, and 
a mean of 0.1 days. Random realizations of lightcurves 
with realistic noise yield similar proflles with maxima, 
whose distribution is consistent with the error. For alert- 
ing the transient community, the convention is to have 2 
independent observations that pass the detection thresh- 
old. For LSST-like cadences, these most likely will hap- 
pen on the same night 1 hour apart. As part of the 
notification message, a lightcurve can be published that 
consists of the measurements up to the last two 5ct obser- 
vations. The crosses show the results for such truncated 
lightcurves, which also closely follow a Gaussian with 
cr=3 days and a 0.5 day mean {dotted line). The shorter 
time baseline and the marginal photometry provide sig- 
nificantly less leverage on the time constraint. We also 
see more deviation, and a slight asymmetry in the points. 
The offset in the peak at these faint magnitudes is a prop- 
erty of the lightcurve's shape and the cadence. When the 
photometric errors are tighter, the shift (and the width) 
is (are) significantly smaller. In comparison, for a z=0.5 
supernova (with the same Af* luminosity) the brighter 
lightcurve yields a negligible offset and a ct~0.25 days. 
More frequent sampling also decreases the offset but it 
is not practical in real life. A better way is to calibrate 
the effect and correct for the shifts, if needed. 

Using Gaussian likelihoods the integrals in /3t can be 
calculated analytically. For a 2-way association it be- 



I3t = 



exp 



(^1 - t2)' 
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This is not a Bayes factor and is not dimensionless. Sub- 
stituting a=l day and At=0 yields f3f = l/y/in day^^, 
which is 0.282/day approximately. The numerical in- 
tegral of the photometric likelihood gives the similar 
0.287/day value, which is in good agreement with the 
analytical estimate. The left panel of Figure[2] illustrates 
f3t as a function of At for Gaussians with the aforemen- 
tioned a values. The solid line shows results from match- 
ing with the complete lightcurves and the dotted line is 
for alerts with truncated photometry. The dashed line 
illustrates the case of matching a full lightcurve versus 
an alert. The right panel shows the same results for a 
case, when the likelihoods are approximated by tophat 
functions with the same standard deviations. The finite 
support of these functions results in /3i=0/day at large 
separations in time, which is a cause for concern if the 
width of the window is not well-understood. For a z=0.5 
redshift SNia with cr=0.25 days the value is 4 times 
larger, /3t=1.128/day, when seen at the same time. 
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T [day] 

Fig. 1. — The likelihood as a function the epoch is calculated 
from the lightcurve of a simulated SNia, whose observations are 
in the LSST cadence. The solid line is a Gaussian that follows 
the computed open circles well. When the survey detects a 5cr 
transient, it issues an alert. The crosses and the dotted line show 
the same curve for a truncated lightcurve that is only available 
up until 2 detections arc over the threshold. The shorter baseline 
yields a weaker constraint. 

These values of f3t are not large, especially consid- 
ering that ttq is typically fairly small, because the 
time constraint are weak. The frequency of all events 
from a telescope can be tens of thousands, say up to 
i/~100,000/day, when including all transients. Since the 
true SNia rate is expected to be around z/* ~l,000/day 
in LSST, our estimate for the lower bound comes to 
ttq ~ 10~^ days. Nevertheless even for the fainter sources, 
we have higher Pt ~ 10~^ values than in the blind match- 
ing o f static sources, where Pq ~ 10~^° (jHeinis et al.l 
I200I . We can increase our prior by filtering the input 
streams, however, the filtering might be better done on 
the associations to find the low significance events in the 
merged stream. In practice, filtering will be applied both 
before and after for optimal performance. 

For completeness we mention that the general result for 
the n-way Gaussian case is also calculated analytically. 
The resulting formula is similarly simple and reads 

A = (2.)-./5^exp|(§^_^| (15) 

where Wi=l/crf are the precision parameters of the ti 
epochs. 

Photometric measurements can also be directly used to 
calculate using the multivariate likelihood function in 
eq. (|13p . In addition to a common epoch, identical events 
have the same physical properties as well, e.g., the abso- 
lute magnitude and the redshift of the supernova. Our 
hypotheses H and K remain the same as before, but 
we parameterize them with not only t, but also with M 
and z. To calculation of /3t involves computing the in- 
tegrals over all three parameters. Of course the numeri- 
cal integral is substantially more expensive computation- 



ally than evaluating an analytic formula but this is the 
proper use of the model that fully exploits its constrain- 
ing power. Following the same prescription as before, 
we generated two lightcurves using the LSST cadences. 
The only difference was in their phase: the cycle of one 
simulated survey was shifted by a day from the other. 
The result for an M^, supernova at 2;=0.8 redshifts is al- 
most 20 times larger than just based on time coincidence, 
/3t=5.3/day. If we are willing to live with the extra com- 
putation and the explicit dependence on a model, this 
way we can boost the probability, in this case, to almost 
1%. Naturally, one would still need spatial constraints 
for any meaningful match, but could potentially leverage 
lower accuracy positional measurements. 

The explicit modeling of the lightcurve provides an op- 
portunity to simultaneously perform classifications and 
outlier detection. We can benefit from calculating sev- 
eral Bayes factors or /3t values for comparison. Associa- 
tions that appear to be reliable matches based on their 
directions and epochs when using a model independent 
approximate time window, might be rejected based in 
a detailed analyses of their SEDs. Accidental noise or 
new discovery? Follow up observations might help to de- 
cide. Beyond the above outlier detection, we can also 
run several SED models to find out in which class the 
candidate event might fit best. Naturally there is also 
nothing to stop us from creating several merged streams 
using different strategies, which target specific scientific 
experiments and communities. 

5. CONCLUSIONS 

We introduced a new statistical method to associate cos- 
mic events based on measured constraints in both time 
and space. Folding in the time-domain information is 
important for a number of reasons: (1) If we can auto- 
matically cross-correlate streams of events from multiple 
surveys, we can find fainter ones that are not obvious 
in any one of the observations but together become sig- 
nificant. This enables us to push these studies to un- 
charted territories. (2) Constraints on the epoch can 
boost small positional evidences in case of low accura- 
cies, e.g., gamma-ray bursts. (3) A more fundamental, 
yet, extremely important practical matter is the assign- 
ment of probabilities. While we can clearly calculate 
Bayes factors based on positional information only, we 
cannot define the probabilities without the time-domain 
data. Conceptually the problem is with the prior. It 
cannot be determined without counting the occurrences 
of events in a fixed time interval, whose duration is for- 
mally arbitrary. We found that the solution is to fold 
in the Bayes factor in time, which naturally cancels the 
artifact. 

The lightcurve measurements constrain the epoch of 
a given event. For Type la supernovae, we found the 
likelihood to be very close to a Gaussian that makes 
the problem analytically tractable. The numerical sim- 
ulations agree with the fast approximations. The same 
method also works for streams of events seen in differ- 
ent wavelengths as long as regime is covered by the SED 
models. To unlock the full potential of the lightcurves 
one can further exploit the SED modeling. The calcula- 
tion is extended to require matching events to have the 
same physical properties. For SNia, the numerical inte- 
grals show that this strategy has the potential to boost 
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Fig. 2. — When crossmatching streams of events, the fit parameter is the time-domain analog of the Bayes factor. Its dependence 
on the time difference is plotted for the 2-way Gaussian case in the left panel with uncertainties seen in Figure[l] The line styles are 
identical to those in Figure[T] The dashed line is for matching one against the other. For comparison, the right panel shows the same using 
tophat likelihood approximations with the same standard deviations. The finite support of these functions means that /3t=0/day at large 
separations in time, which in turn yields posterior probability. The loss of associations, in this case, is a cause for concern, if the width 
of the time-window is not well-understood. 



the time-domain evidence by a factor of 20, which means 
that one can get away with less accurate spatial mea- 
surements. 

Although we illustrated the methodology on super- 
novae, it is clearly applicable to other types of events. 
Before g amma-ray b ursts were established to be extra- 
galactic, iLuo et all (|1996D applied Bayesian hypothesis 
testing to look for repeating gamma-ray bursts. Since 
then we have seen the host galaxy of many GRBs. Us- 
ing the time-domain cross-identification we can look for 
other most subtle events in their hosts from related phys- 
ical processes. Another example is the detection of gravi- 
tational waves. The LIGO Co llaboration discusse s astro- 
physically triggered searches () Abbot et al.l 120081 ) . Their 
idea is to lower the detection threshold for a short period 
of time, when other type of cosmic events occur. This is 
in fact a very similar problem: merging streams of events 
from LIGO and other projects with our probabilistic ap- 
proach could boost their significance and help facilitate 



new discoveries. 

The general cross-identification problem in astronomy 
is rather convoluted and ultimately tightly interwoven 
with the scientific analyses. One obvious example is clas- 
sification, which in turn would affect the matching prior. 
The circular nature of the problem is not an issue but one 
needs to find a consistent solution. The explicit assump- 
tions in the probabilistic approach enables us to incorpo- 
rate new, different kind of data on top of the epochs and 
the positions, when available. This is a crucial feature, 
which will prove vital in studying the time-domain and 
understanding the eventful Universe. 

The author would like to thank Andy Connolly, Alex 
Szalay, Istvan Csabai, Matthew Graham, Suvi Gezari, 
Josh Bloom, Joey Richards, Roy Williams and Tom 
Loredo for several stimulating discussions on various as- 
pects of the topic. This study was supported by the 
Gordon and Betty Moore foundation via GBMF 554. 



REFERENCES 



Abbott, B., et al. 2008, Class. Quantum Grav., 25 114051 
Budavari, T., & Szalay, A. S., 2008, ApJ, 679, 301 
Dahlen, T., et al. 2004, ApJ, 613, 189 
Drake, A. J., et al., 2009, ApJ, 696, 870 

Heinis, S., Budavari, T., & Szalay, A. S. 2009, ApJ, 705, 739 
Ivezic, Z., et al. 2008, Serbian Astronomical Journal, 176, 1 



Keller, S. C, et al. 2007, Publications of the Astronomical 

Society of Australia, 24, 1 
Luo, S., Loredo, T., & Wasserman, I. 1996, American Institute of 

Physics Conference Series, 384, 477 
Rau, A., et al. 2009, PASP, 121, 1334 



